Search CORE

7 research outputs found

Accelerating Time Series Analysis via Processing using Non-Volatile Memories

Author: Fernandez Ivan
Ghiasi Nika Mansouri
Giannoula Christina
Gutierrez Eladio
Gómez-Luna Juan
Manglik Aditya
Mutlu Onur
Plata Oscar
Quislant Ricardo
Publication venue
Publication date: 08/11/2022
Field of study

Time Series Analysis (TSA) is a critical workload for consumer-facing devices. Accelerating TSA is vital for many domains as it enables the extraction of valuable information and predict future events. The state-of-the-art algorithm in TSA is the subsequence Dynamic Time Warping (sDTW) algorithm. However, sDTW's computation complexity increases quadratically with the time series' length, resulting in two performance implications. First, the amount of data parallelism available is significantly higher than the small number of processing units enabled by commodity systems (e.g., CPUs). Second, sDTW is bottlenecked by memory because it 1) has low arithmetic intensity and 2) incurs a large memory footprint. To tackle these two challenges, we leverage Processing-using-Memory (PuM) by performing in-situ computation where data resides, using the memory cells. PuM provides a promising solution to alleviate data movement bottlenecks and exposes immense parallelism. In this work, we present MATSA, the first MRAM-based Accelerator for Time Series Analysis. The key idea is to exploit magneto-resistive memory crossbars to enable energy-efficient and fast time series computation in memory. MATSA provides the following key benefits: 1) it leverages high levels of parallelism in the memory substrate by exploiting column-wise arithmetic operations, and 2) it significantly reduces the data movement costs performing computation using the memory cells. We evaluate three versions of MATSA to match the requirements of different environments (e.g., embedded, desktop, or HPC computing) based on MRAM technology trends. We perform a design space exploration and demonstrate that our HPC version of MATSA can improve performance by 7.35x/6.15x/6.31x and energy efficiency by 11.29x/4.21x/2.65x over server CPU, GPU and PNM architectures, respectively

arXiv.org e-Print Archive

Repository for Publications and Research Data

GenPIP: In-Memory Acceleration of Genome Analysis via Tight Integration of Basecalling and Read Mapping

Author: Alser Mohammed
Alserr Nour Almadhoun
Baranwal Akanksha
Cali Damla Senol
Firtina Can
Manglik Aditya
Mao Haiyu
Mutlu Onur
Sadrosadati Mohammad
Publication venue
Publication date: 18/09/2022
Field of study

Nanopore sequencing is a widely-used high-throughput genome sequencing technology that can sequence long fragments of a genome into raw electrical signals at low cost. Nanopore sequencing requires two computationally-costly processing steps for accurate downstream genome analysis. The first step, basecalling, translates the raw electrical signals into nucleotide bases (i.e., A, C, G, T). The second step, read mapping, finds the correct location of a read in a reference genome. In existing genome analysis pipelines, basecalling and read mapping are executed separately. We observe in this work that such separate execution of the two most time-consuming steps inherently leads to (1) significant data movement and (2) redundant computations on the data, slowing down the genome analysis pipeline. This paper proposes GenPIP, an in-memory genome analysis accelerator that tightly integrates basecalling and read mapping. GenPIP improves the performance of the genome analysis pipeline with two key mechanisms: (1) in-memory fine-grained collaborative execution of the major genome analysis steps in parallel; (2) a new technique for early-rejection of low-quality and unmapped reads to timely stop the execution of genome analysis for such reads, reducing inefficient computation. Our experiments show that, for the execution of the genome analysis pipeline, GenPIP provides 41.6X (8.4X) speedup and 32.8X (20.8X) energy savings with negligible accuracy loss compared to the state-of-the-art software genome analysis tools executed on a state-of-the-art CPU (GPU). Compared to a design that combines state-of-the-art in-memory basecalling and read mapping accelerators, GenPIP provides 1.39X speedup and 1.37X energy savings.Comment: 17 pages, 13 figure

arXiv.org e-Print Archive

NEON: Enabling Efficient Support for Nonlinear Operations in Resistive RAM-based Neural Network Accelerators

Author: Manglik Aditya
Mao Haiyu
Mutlu Onur
Orosa Lois
Park Jisung
Patel Minesh
Salami Behzad
Publication venue: Cornell University
Publication date: 10/11/2022
Field of study

Resistive Random-Access Memory (RRAM) is well-suited to accelerate neural network (NN) workloads as RRAM-based Processing-in-Memory (PIM) architectures natively support highly-parallel multiply-accumulate (MAC) operations that form the backbone of most NN workloads. Unfortunately, NN workloads such as transformers require support for non-MAC operations (e.g., softmax) that RRAM cannot provide natively. Consequently, state-of-the-art works either integrate additional digital logic circuits to support the non-MAC operations or offload the non-MAC operations to CPU/GPU, resulting in significant performance and energy efficiency overheads due to data movement. In this work, we propose NEON, a novel compiler optimization to enable the end-to-end execution of the NN workload in RRAM. The key idea of NEON is to transform each non-MAC operation into a lightweight yet highly-accurate neural network. Utilizing neural networks to approximate the non-MAC operations provides two advantages: 1) We can exploit the key strength of RRAM, i.e., highly-parallel MAC operation, to flexibly and efficiently execute non-MAC operations in memory. 2) We can simplify RRAM's microarchitecture by eliminating the additional digital logic circuits while reducing the data movement overheads. Acceleration of the non-MAC operations in memory enables NEON to achieve a 2.28x speedup compared to an idealized digital logic-based RRAM. We analyze the trade-offs associated with the transformation and demonstrate feasible use cases for NEON across different substrates

Repository for Publications and Research Data

A Case for Transparent Reliability in DRAM Systems

Author: Luo Haocong
Manglik Aditya
Mutlu Onur
Olgun Ataberk
Patel Minesh
Shahroodi Taha
Yağlikçi A. Giray
Publication venue: Cornell University
Publication date: 21/04/2022
Field of study

Today's systems have diverse needs that are difficult to address using one-size-fits-all commodity DRAM. Unfortunately, although system designers can theoretically adapt commodity DRAM chips to meet their particular design goals (e.g., by reducing access timings to improve performance, implementing system-level RowHammer mitigations), we observe that designers today lack sufficient insight into commodity DRAM chips' reliability characteristics to implement these techniques in practice. In this work, we make a case for DRAM manufacturers to provide increased transparency into key aspects of DRAM reliability (e.g., basic chip design properties, testing strategies). Doing so enables system designers to make informed decisions to better adapt commodity DRAM to meet modern systems' needs while preserving its cost advantages. To support our argument, we study four ways that system designers can adapt commodity DRAM chips to system-specific design goals: (1) improving DRAM reliability; (2) reducing DRAM refresh overheads; (3) reducing DRAM access latency; and (4) mitigating RowHammer attacks. We observe that adopting solutions for any of the four goals requires system designers to make assumptions about a DRAM chip's reliability characteristics. These assumptions discourage system designers from using such solutions in practice due to the difficulty of both making and relying upon the assumption. We identify DRAM standards as the root of the problem: current standards rigidly enforce a fixed operating point with no specifications for how a system designer might explore alternative operating points. To overcome this problem, we introduce a two-step approach that reevaluates DRAM standards with a focus on transparency of DRAM reliability so that system designers are encouraged to make the most of commodity DRAM technology for both current and future DRAM chips

arXiv.org e-Print Archive

Repository for Publications and Research Data

RevaMp3D: Architecting the Processor Core and Cache Hierarchy for Systems with Monolithically-Integrated Logic and Memory

Author: Ausavarungnirun Rachata
Ferreira João
Giannoula Christina
Gómez Luna Juan
Kanellopoulos Constantinos
Kim Jeremie S.
Manglik Aditya
Mansouri Ghiasi Nika
Mutlu Onur
Oliveira Geraldo F.
Park Jisung
Sadrosadati Mohammad
Vijaykumar Nandita
Publication venue: Cornell University
Publication date: 16/10/2022
Field of study

Recent nano-technological advances enable the Monolithic 3D (M3D) integration of multiple memory and logic layers in a single chip with fine-grained connections. M3D technology leads to significantly higher main memory bandwidth and shorter latency than existing 3D-stacked systems. We show for a variety of workloads on a state-of-the-art M3D system that the performance and energy bottlenecks shift from the main memory to the core and cache hierarchy. Hence, there is a need to revisit current core and cache designs that have been conventionally tailored to tackle the memory bottleneck. Our goal is to redesign the core and cache hierarchy, given the fundamentally new trade-offs of M3D, to benefit a wide range of workloads. To this end, we take two steps. First, we perform a design space exploration of the cache and core's key components. We highlight that in M3D systems, (i) removing the shared last-level cache leads to similar or larger performance benefits than increasing its size or reducing its latency; (ii) improving L1 latency has a large impact on improving performance; (iii) wider pipelines are increasingly beneficial; (iv) the performance impact of branch speculation and pipeline frontend increases; (v) the current synchronization schemes limit parallel speedup. Second, we propose an optimized M3D system, RevaMp3D, where (i) using the tight connectivity between logic layers, we efficiently increase pipeline width, reduce L1 latency, and enable fine-grained synchronization; (ii) using the high-bandwidth and energy-efficient main memory, we alleviate the amplified energy and speculation bottlenecks by memoizing the repetitive fetched, decoded, and reordered instructions and turning off the relevant parts of the core pipeline when possible. RevaMp3D provides, on average, 81% speedup, 35% energy reduction, and 12.3% smaller area compared to the baseline M3D system

Repository for Publications and Research Data

Recommended from our members

An ultrapotent synthetic nanobody neutralizes SARS-CoV-2 by stabilizing inactive Spike.

Author: Anand Aditya A
Azumaya Caleigh M
Barile-Hill Andrew W
Barsi-Rhyne Benjamin
Belyy Vladislav
Billesbølle Christian B
Boone Morgane
Brilot Axel F
Bulkley David
Chio Cynthia M
Chio Un Seng
Deshpande Ishan
Dickinson Sasha
Diwanji Devan
Dobzinski Niv
Faust Bryan
García-Sastre Adolfo
Gupta Meghna
Gupta Sayan
Hoppe Nick
Jin Mingliang
Kratochvil Huong T
Krogan Nevan J
Leon Kristoffer
Li Fei
Liang Jiahao
Liu Yanxin
Liu Yuwei
Manglik Aashish
Merz Gregory E
Moss Frank
Nguyen Henry C
Nock Silke
Ott Melanie
Pospiech Thomas
Pourmal Sergei
Puchades Cristina
QCRG Structural Biology Consortium
Ralston Corie Y
Rezelj Veronica
Rizo Alexandrea N
Sangwan Smriti
Saunders Reuben A
Schaefer Kaitlin
Schoof Michael
Simoneau Camille R
Smith Amber M
Sun Ming
Swaney Danielle L
Thompson Michael C
Trenker Raphael
Vignuzzi Marco
Walter Peter
White Kris M
Zha Beth Shoshana
Zhang Kaihua
Zimanyi Marcell
Publication venue: eScholarship, University of California
Publication date: 01/12/2020
Field of study

The severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) virus enters host cells via an interaction between its Spike protein and the host cell receptor angiotensin-converting enzyme 2 (ACE2). By screening a yeast surface-displayed library of synthetic nanobody sequences, we developed nanobodies that disrupt the interaction between Spike and ACE2. Cryo-electron microscopy (cryo-EM) revealed that one nanobody, Nb6, binds Spike in a fully inactive conformation with its receptor binding domains locked into their inaccessible down state, incapable of binding ACE2. Affinity maturation and structure-guided design of multivalency yielded a trivalent nanobody, mNb6-tri, with femtomolar affinity for Spike and picomolar neutralization of SARS-CoV-2 infection. mNb6-tri retains function after aerosolization, lyophilization, and heat treatment, which enables aerosol-mediated delivery of this potent neutralizer directly to the airway epithelia

eScholarship - University of California

Recommended from our members

An ultrapotent synthetic nanobody neutralizes SARS-CoV-2 by stabilizing inactive Spike.

Author: Anand Aditya A
Azumaya Caleigh M
Barile-Hill Andrew W
Barsi-Rhyne Benjamin
Belyy Vladislav
Billesbølle Christian B
Boone Morgane
Brilot Axel F
Bulkley David
Chio Cynthia M
Chio Un Seng
Deshpande Ishan
Dickinson Sasha
Diwanji Devan
Dobzinski Niv
Faust Bryan
García-Sastre Adolfo
Gupta Meghna
Gupta Sayan
Hoppe Nick
Jin Mingliang
Kratochvil Huong T
Krogan Nevan J
Leon Kristoffer
Li Fei
Liang Jiahao
Liu Yanxin
Liu Yuwei
Manglik Aashish
Merz Gregory E
Moss Frank
Nguyen Henry C
Nock Silke
Ott Melanie
Pospiech Thomas
Pourmal Sergei
Puchades Cristina
QCRG Structural Biology Consortium
Ralston Corie Y
Rezelj Veronica
Rizo Alexandrea N
Sangwan Smriti
Saunders Reuben A
Schaefer Kaitlin
Schoof Michael
Simoneau Camille R
Smith Amber M
Sun Ming
Swaney Danielle L
Thompson Michael C
Trenker Raphael
Vignuzzi Marco
Walter Peter
White Kris M
Zha Beth Shoshana
Zhang Kaihua
Zimanyi Marcell
Publication venue: eScholarship, University of California
Publication date: 01/12/2020
Field of study

eScholarship - University of California